ysifandomcom-20200214-history
Base 64 Encoding
INI_WriteArray and other functions, most notably the libraries and , use a custom version of base 64 encoding to store their data as compactly as possible in an ASCII file. Encoding Given the following array: There is 20 cells of data, or 80 bytes. The longest numbers that can be stored in a single cell in PAWN are -1000000000 to -2147483647 (technically -2147483648, but this number is poorly defined). Writing this data out in this manner, assuming a separator of ',' would take up to 12 * 20 = 240 characters, or 240 bytes in ASCII, 3 times the original data size. At the other end, the data can NOT be stored just as-is because of the presence of bytes like '\0' or '\n' which would mess up in many encoding schemes (\n is a new line in ASCII, so if any byte of the data contained the number 13 this would break the storage when using files). Instead of converting all 8 bits of a byte in to a character, just 6 of the bits are used, and the other two added to the next byte. Consider the number 123456; in hex this is 0x0001E240; split in to 4 8-bit parts, this is: 0b00000000 0b00000001 0b11100010 0b01000000 However, if we split this data up in to (about) 5 6-bit parts we get: 0b000000 0b000000 0b000111 0b100010 0b010000 0b00 Now, 0b000000 is still NULL, and other characters are still invalid, but this is where the concept of base 64 comes from. Because there are 64 different number in 6 bits we can find a range of 64 consecutive valid ASCII characters, this happens to be found between '>' (62) and '}' (125). Although lower ranges are possible, they include the equals and semi-colon signs, which can cause problems when reading INI files if not used correctly (especially the latter, which is a comment symbol). So if we add 62 to all possible 6-bit numbers, we get valid ASCII characters that can be written to a file without any problems: 0b000000 + 62 = 62 ('>') 0b000000 + 62 = 62 ('>') 0b000111 + 62 = 69 ('E') 0b100010 + 62 = 103 ('g') 0b010000 + 62 = 78 ('N') 0b00 (N/A) Using this scheme a 20 cell (80 byte) array takes 107 (rounded up) bytes to store in ASCII - more than required for binary (not possible), but less than half of what is required to print out every possible number. This gives output looking something like: >>>a>>>>>>>{>>>K_>>>>>>N>>>>>>>>>>>I>>>r>>>>>d>>>>>>>> NULLs The vast number of '>' relative to any other letter is due to the relative abundance of bytes that are all zero in most binary data. This actually lends itself nicely to a further compression technique since ASCII numbers ('0'-'9') are NOT valid base 64 encoding characters in this scheme so a string of multiple '>'s could be replaced by the number of characters removed: 3a7{3K_6N11I3r5d8 Note that this storage optimisation is not yet implemented. Multiple Cells The code above neatly ignored the problem with the two extra bits. When only dealing with one cell these are padded to 6 bits with 0b0000, when dealing with multiple cells they are effectively added on to the next cell to give a 34-bit number. Unfortunatly, when this larger number is written, it again takes 5 ASCII characters and leaves 4 bits remaining. Again, if this is the last cell this number is padded to 6 bits with 0b00; however, if there is another cell they are again appended, but this time the result is a 36-bit number that can be written in its entirety using 6 ASCII characters. In short, 3 cells (12 bytes) can be exactly encoded in 16 ASCII characters. This ratio is always exact and means that the encoded data is always 1.33 times larger than the original data (rounded up, and prior to any data compression). Writing has a function INI_WriteArray which performs the encoding above and stores the results to a file. However, y_ini has a static line length limit, so a slight change is required to split the data over multiple lines. Consider the following call: For an array of 26 cells, this gives data something like: MyData = 26 @@MyData-0 = >>>>HbG@@MyData-1 = >>>>>>>>>>> First the length is written to the file so that any loading code knows how much data to expect. Next the data is written out, but split over multiple lines due to system limits. Each line ends with a number to denote which part it is, and starts with @@ to reduce the chances of conflicts with other keys (this is further reduced by both and using custom INI tags. Optimisation As already mentioned, 3 cells can be stored in 16 characters, so the code is written with this in mind and has dedicated code for each of those 16 characters in the form of a series of write[wi++ lines that are fairly obvious in the function. Reading This is the reverse of writing, but is not currently documented. The loading code is given here, note that the small bit of #emit code is just used to load the address of the storage location, and is not normally required if the target array is known. The main if checks for the presence of '-', if it is found it is data, if it isn't then it is the length of the data, and the name is used to find the correct storage location. This part may change if the base 64 data is with other values. The remainder of the code mirrors what was said about the writing dealing with three cells at once to cover a complete set of 16 ASCII characters.